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Abstract 

Fairness-aware learning is a novel framework for classification tasks. Like regular empirical risk 
minimization (ERM), it aims to learn a classifier with a low error rate, and at the same time, for the 
predictions of the classifier to be independent of sensitive features, such as gender, religion, race, 
and ethnicity. Existing methods can achieve low dependencies on given samples, but this is not 
guaranteed on unseen samples. The existing fairness-aware learning algorithms employ different 
dependency measures, and each algorithm is specifically designed for a particular one. Such di¬ 
versity makes it difficult to theoretically analyze and compare them. In this paper, we propose 
a general framework for fairness-aware learning that uses /-divergences and that covers most of 
the dependency measures employed in the existing methods. We introduce a way to estimate the 
/-divergences that allows us to give a unified analysis for the upper bound of the estimation error; 
this bound is tighter than that of the existing convergence rate analysis of the divergence estima¬ 
tion. With our divergence estimate, we propose a fairness-aware learning algorithm, and perform 
a theoretical analysis of its generalization error. Our analysis reveals that, under mild assump¬ 
tions and even with enforcement of fairness, the generalization error of our method is 0{^Jl/n), 
which is the same as that of the regular ERM. In addition, and more importantly, we show that, 
for any /-divergence, the upper bound of the estimation error of the divergence is 0{-\/TJn). 
This indicates that our fairness-aware learning algorithm guarantees low dependencies on unseen 
samples for any dependency measure represented by an /-divergence. 


1 Introduction 

Recently developed information systems are being increasingly incorporating machine learning 
techniques for making important decisions, such as credit scoring, calculating insurance rates, and 
evaluating employment applications. These decisions can result in the unfair treatment, if the deci¬ 
sions depend on the sensitive information, such as the individual’s gender, religion, race, or ethnicity. 
Fairness-aware learning attempts to solve this problem, and has recently received a great deal of at¬ 
tention BlMlIlol. In this paper, we consider the use of fairness-aware learning for classification 
problems. 

Let X and y — {1,..., c} be the domain of the input and the domain of the target, respectively. 
In ordinary classification algorithms, the learner aims to find a hypothesis f : X ^ y that min¬ 
imizes misclassifications from a given set of iid samples Sn = G {Z = X x T)"- 

In fairness-aware learning, we assume that the input Xi contains a viewpoint Vi G V, which repre¬ 
sents the sensitive information of individuals. The learner aims to find an / that will have a low 
misclassification rate and for which the output of / has little dependency on the viewpoint v. For 
example, suppose the company want to make a hiring decision using information collected from job 
applicants (input x), including their age, place of residence, and work experience, but also including 
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their gender, religion, race, and ethnicity (viewpoint v). We wish to make hiring decisions based on 
the potential work performance of the job applicants (target y) via a supervised learning algorithm, 
y = f{x). We say / is discriminatory if the output of / is dependent on the viewpoint v ifTSl . 
Fairness-aware learning attempts to avoid such unfair decisions by minimizing the dependency of 
the output of / on the viewpoint v 13 liiini nulla- Needless to say, minimization of the misclas- 
sification rate and minimization of the dependency are conflicting targets. Therefore, we need to 
consider the trade-off between misclassification and dependency. 

The existing methods resolve this conflict by suppressing the dependency to sensitive view¬ 
points; this is accomplished by introducing a regularization term ismuEa or by adding con¬ 
straints Emnnn, to the objective function of the regular empirical risk minimization (See Sec- 
tion|3l. Typically, such techniques can lead to predictions for which there is less dependency on the 
sensitive viewpoints of the given samples (empirical dependency). However, predictors with low 
empirical dependency do not necessarily achieve low dependency on sensitive viewpoints of unseen 
samples (generalization dependency). In the hiring decision example, the hypothesis / is trained 
with information collected from the past histories of job applicants. Predictors trained with existing 
methods might make fair decisions for the job applicants in the past (low empirical dependency). 
However, fair decisions for the job applicants in the future (low generalization dependency) are not 
guaranteed. Except for the method of Fukuchi and Sakuma 0, most of the existing methods have 
no theoretical guarantee of the generalization dependency. In 0, theoretical analysis provides a 
probabilistic bound on the generalization dependency, but the analysis is derived for only a specific 
measure of dependency. 

Our contributions. 

We perform a unified analysis of the fairness-aware learning with more general dependency mea¬ 
sures based on the /-divergence 0111. The /-divergence is a universal class of the divergences, 
which can represent most of existing divergences, including the total variational distance, the covari¬ 
ance, the Hellinger distance, the -divergence, and the KL-divergence. Our fairness-aware learning 
basically follows the framework of empirical risk minimization (ERM). The goal of fairness-aware 
learning is to obtain predictors with an upper bound guarantee of generalization dependency; how¬ 
ever, it cannot be directly evaluated because the underlying distribution is not observable. We thus 
derive an upper bound of the generalization dependency by the empirical dependency plus two extra 
terms. Our framework achieves the fairness of the resultant predictors by restricting the class of 
hypotheses to those with low empirical dependency. Thus, the upper bound of the generalization 
dependency of the predictors can be theoretically derived by using the bound. 

The contributions of this study are two-fold. First, we propose a novel generalized procedure for 
estimating the /-divergences for fairness-aware learning. Our estimation method can be regarded 
as a generalization of 112113 US). As already stated, we constrain the hypothesis class by the 
/-divergence for guarantee of fairness. It is thus important to derive a tighter upper bound of the 
/-divergence to achieve lower generalization dependency. Existing divergence estimation meth¬ 
ods lfT3l provides an upper bound of the /-divergence; however, the bound is not suitable for our 
purpose for the following two reasons. First, their analysis is specifically derived for KL-divergence 
and cannot be expanded to the general /-divergences. Second, their bound is derived for conver¬ 
gence analysis, not for the upper bound of the divergence. Thus, the bound is loose for our purpose. 
Our generalized estimation procedure provides a tighter upper bound of the /-divergences by in¬ 
troducing the maximum mean discrepancy. As a result, the estimation error of the /-divergence is 
bounded above by the empirical maximum mean discrepancy and by 0{y/T/n). 

Second, we formulate a general ERM framework for fairness-aware learning with employing the 
/-divergence. We analyze the generalization error and generalization dependency of the proposed 
fairness-aware learning algorithm, and we show that even when fairness is enforced, the general¬ 
ization error can be bounded above by the Rademacher complexity and 0{y/l/n), as in the regu¬ 
lar ERM. The generalization dependency can be bounded above by the empirical maximum mean 
discrepancy term and two other extra terms. Thanks to the theoretical analysis of generalization 
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dependency, we can theoretically compare the upper bound of the estimation error by dependency 
measures. Our analysis revealed that the divergence estimation errors for all of these divergences 
are 0(-y/l/n) equally, and the Hellinger distance achieves the lowest estimation error in terms of 
the constant term of the probabilistic error. We also derived a convex formulation of fairness-aware 
learning that works with any dependency measures represented by the /-divergence. The optimiza¬ 
tion problem can be readily solved by a standard convex optimization solver. 


2 Related Works 


Within the setup described in the Introduction, Calders and Verwer 13 pointed out that elimination 
of the viewpoint from the given samples is insufficient for achieving low correlation between the 
output of / and the viewpoint; this is because the viewpoint has an indirect influence since it is 
not independent from the input. For example, when we make a hiring decision using information 
collected from job applicants via supervised learning, even if we train with samples that exclude 
race and ethnicity, the output of the resultant hypothesis / may be indirectly correlated with race 
or ethnicity, because the addresses of the applicants may be correlated with their race or ethnicity. 
Such an indirect effect is called the red-lining effect Q. 

To remove the red-lining effect, existing works have attempted to construct a classiher that 
results in a fairer hypothesis. Calders and Verwer 13 proposed the naive Bayes classifier 
with fairness constraint, which employs the difference between the conditional probabilities 
|Pi'(/(^) = y+\v+) - Pr(/(X) = y+\v-)\ where y = {y+,y-} and V = Kamiran 

et al. Ho) and Zliobaite et al. 123 discussed various situations in which discrimination can occur, 
in terms of the difference of conditional probabilities. Dwork et al. 0 introduced a fairness-aware 
learning framework of the ERM with constraints of statistical parity, dehned as the total variational 
distance between Pr(/(X)|u+) and Pr(/(X)|u_). Zemel et al. 123 presented an algorithm to pre¬ 
serve fairness in a classihcation setting based on statistical parity. Kamishima et al. ini proposed 
a fairness-aware learning algorithm of the maximum likelihood estimation by penalizing the log- 
likelihood using KL-divergence between Pr(/(2f), V) and Pr(/(2f))Pr(v|^ These fairness-aware 
learning algorithms do not have a theoretical guarantee for the estimation error of the dependency 
measures. In addition, the design of these algorithms are tightly coupled with specihc dependency 
measures. They thus have less flexibility for choosing other dependency measures. 

The fairness for unseen samples can be measured by the estimation error bound of the dependency 
measure. Fukuchi and Sakuma 0 first derived a bound on the estimation error of a specific measure, 
namely the +1/-1 neutrality risk. They proved that the estimation error of the measure is bounded 
above in probability by the Rademacher complexity of the hypothesis class T and 0(^^/\fn) term. 
Unfortunately, the analysis relies on the H-1/-1 neutrality risk and cannot be generalized to other 
types of dependency measures. 

Estimation procedures that use /-divergences that are based on iid samples have been studied exten¬ 
sively. Eor example, for the KL-divergence, method have been proposed that use nearest-neighbor 
distances ED and least-squares estimations of the probability ratio 113. To estimate /-divergences, 
Garcfa-Garcia et al. O introduced an estimation procedure that uses loss minimization and sam¬ 
pling. Kanamori et al. II3 presented a divergence estimator of the /-divergences based on using 
the moment matching estimator El to estimate the probability ratio. Nguyen et al. M used a 
property of convex conjugate functions to derive the M-estimator of /-divergences, and they also 
derived its convergence rate. In our analysis, we derive the upper bound of the estimation error, 
which yields a tighter upper bound of the estimation error of dependency measures compared to the 
existing convergence rate analysis. 

'The KL divergence between Pr(/(X), V) and Pr{/(X))Pr(y) is known as mutual information between f{X) and 
V. 
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3 Problem Formulation 


Let X and y — {1,c} be the domain of the input and the domain of the target, respectively. We 
assume that the learner obtains a set of iid samples Sn = € (Z = X x y)^ that 

are drawn from an unknown probability measure p, which is dehned on some measurable space 
(Z,3)- In addition, we assume the input Xi consists of the viewpoint Vi G V and various other 
features Wi G W. Thus, X = V x W. Given iid samples, the learner seeks to hnd a hypothesis 
f : X —>■ y from a class of measurable functions X that minimizes both the misclassihcation rate 
and the dependence on the viewpoint v. We denote {Xi, Yi) for i = 1, n as the random variables 
of the samples, and we denote Vi as the random variable for the corresponding viewpoint. 

The misclassihcation of the hypothesis / is evaluated by the generalization risk, which is dehned as 
R{f) = E[£(y, f{X))]. The goal of the learner is to hnd the hypothesis f* G J- such that 

R{n = inf 

J 

The generalization risk R{f ) cannot be evaluated directly because the sample distribution p, is un¬ 
known. Instead of the generalization risk, empirical risk minimization (ERM) hnds a hypothesis 
fnGF that minimizes the empirical risk 

1 ” 
n ^' 

Minimization of the empirical risk results in a relatively low generalization risk, and the general¬ 
ization risk of the resultant hypothesis converges towards that of the optimal hypothesis f* as the 
number of samples increases; this has been shown theoretically El- 


3.1 A Generalized Class for Dependency Measures 

For the evaluation of the dependency of the output of / on the viewpoint v, we dehne a general 
class of measures for dependency.Given that Pr(E)Pr(/(X)) = Pr(C, f{X)) if f{X) and V are 
statistically independent, we can evaluate this dependency by evaluating the difference between the 
two probability measures, Pr{V)Pi{f{X)) and Pr(l/, f{X)). To measure the difference between 
two probability measures, we use the /-divergences. Suppose P and Q are two probability measures 
on a compact domain X, where P is absolutely continuous with respect to Q. The class of /- 
divergences, also known as the Ali-Silvey distances cm, takes the form 

where ip : K+ —M is a convex and lower semicontinuous function such that ^(1) = o|^ After 
we define the /-divergences, we dehne the measure of the dependency between f{X) and V as 
follows: 


D4f) =D^{PiiV)Pr{f{X)),Pr{V, f{X))). 

Without loss of generality, we will assume that the subdifferential of at 1 contains 0. This can 
be readily conhrmed by (pc{u) = 4>{u) — c{u — 1) which does not change the value of the /- 
divergences Q) = ^^(P, Q) for any hnite c. We will focus on the convex functions cp that 

are differentiable on IR+ except at 1. Note that this includes most of the divergences, including the 
total variational distance, the Hellinger distance, the -divergence, and the KL-divergence. 

^The /-divergence becomes one of the existing divergences due to the choice of </; that is, it becomes the total variational 
distance if </>(«) = |rt — 1|, the Hellinger distance if 0(ti) = (\/u — 1)^, the x^-divergence if </(u) = {u — l)^/n, or the 
KL-divergence if </(«) = (u — 1) — ln(u). See Figure [T^ 


4 



3.2 Fairness-Aware Learning with Generalized Dependency Measures 

In fairness-aware learning, the learner attempts to minimize both R{f) and D^{f). However, since 
argminjgjr i?(/) = argmin^^jr £) 0 (/) does not always hold, there exists a trade-off between 
R{f) and D^{f). We thus consider a subset of R parameterized by p > 0 defined as follows: 

Rr, = {f& < p}. 

Thus, the goal of the fairness-aware learning is to achieve the hypothesis /* € that satis¬ 
fies 


Rif;) = inf Rif). 

J GJ' ri 

Again, since the generalization risk cannot be evaluated directly, the learner minimizes the empirical 
risk as 


min Rnif). (1) 

The objective of fairness-aware learning is to solve the optimization problem of Eq. 0 . Unfortu¬ 
nately, cannot be evaluated directly again since the underlying distribution is unobservable. 

In Section]^ we introduce a novel estimation procedure of the /-divergences to alleviate evaluation 
of D^{f). Then, we prove an upper bound of D^{f) with empirical estimation of D^{f) given a 
finite number of samples. In Section]^ the objective function of fairness-aware learning is redefined 
using the empirical estimation of D^{f). 


4 Divergence Estimation 

In this section, we introduce a procedure that involves minimizing the maximum mean discrep¬ 
ancy (MMD) for estimating and we determine a non-asymptotic bound on the estimation 

error. This procedure covers the existing /-divergences or KL-divergence estimation algorithms 
proposed by Nguyen et al. m, Ruderman et al. and Kanamori et al. 113. 


4.1 Estimation of the Divergence by Minimizing the Maximum Mean Dis¬ 
crepancy 


To estimate D^{f), we first empirically estimate the probability ratio r{V,f{X)) = 

dPr(U)Pr(/(X))/dPr(U, f{X)), and then we empirically evaluate D^{f) by using the estimated 
probability ratio. Since dPr(U)Pr(/(A')) = r{V, f{X))dPT{V, fiX)) holds for the probability 
ratio r, the minimizer of the difference between fiPr(U)Pr(/(X)) and r(V, /(Ai))dPr(U, /(AT)) 
is expected to be close to the probability ratio. As a measure of the disparity of dPr(U)Pr(/(Ai)) 
and r{V, f{X))dPr{V, f{X)), we use the maximum mean discrepancy. Let ^ be a set of functions 
p : V X 3^ — > K. Let X' be an independent copy of X, and let V be the viewpoint of X' . Then, the 
MMD with Q between dPi{V)Pi{f{X)) and r{V, f{X))dPi{V, fiX)) is defined as 


L*MMD,/(f)=sup 

geG 


g{VJ{X))d{PT{V)PTU{X))) 


g{V,f{X))r{V,f{X))dPT{V, f{X)) 


=sup[E[ 5 (r, /(A))]-E[r(U, fiX))giV, f{X))]]. (2) 


If r is equivalent to the probability ratio, we have I?MMD./(f) = 0. However, I?MMD,/(f) = 0 
dPr(U)Pr(/(A)) = r{V, /(A))dPr(U, /(A)) does always not satisfy, which requires that ^ is a 
set of functions on a universal kernel El. Therefore, the evaluation ability of the discrepancy of 
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MMD is dependent on the choice of Q.The U-statistics lH gives an unbiased estimator of Eq. Q 
as 


DMMD,f,n{r) = sup 
g&O 


^^ 7 ;^ E [9iV^,f{X,))-riV,,f{X,))g{V.,f{X,))] 


The estimator of the probability ratio is obtained by minimizing -DMMD,/,ra(?’)|^ We can add the 
regularizer term n(r) to the empirical MMD to ensure the consistency of the estimator: 


min L>MMD,/,n(?'„) + fl(r„). (3) 

rTi>0 

After obtained r„ by solving Eq. 0, the /-divergence is empirically evaluated as 

1 " 

The estimation procedure is equivalent to III if n(r) = XnD^^nif) where An is regularizer pa¬ 
rameter. In addition to the regularizer term, if we add the constraint ^ fiXi)) = 1 

into Eq. (|^, the estimation procedure becomes same as ifTSll . Letting n(r) = 0 and the appropriate 
choice of Q yields the estimation procedure of m. 


4.2 Analysis of Estimation Error 

In this subsection, we show the upper bound on the estimation error 

DM)-D4,Af)- 

Surprisingly, the upper bound of the estimation error does not depend on the complexity of the 
class of functions Q. In what follows, we use D^{f) = 'E[(j){r{V, f{X)))] and = 

Sr=r f{Xi)))/n. In addition, we denotes the true probability ratio as r* (V, f{X)). 

The following theorem states the probabilistic upper bound on the estimation error. 

Theorem 1. Let be the probability ratio estimated from the obtained set of samples S^- Suppose 
that the class of the functions Q of the MMD contains d(f>(r*(V, f{X)))/a, where df is an element 
of the subdifferential off, a > 0 is some constant, and cn < r*(V, f{X)) < c„ almost surely, where 
Ci G (0,1] and Cu £ [1, 00). Then, with probability at least \ — e~ 

D,p{f) < D^^n{f) + aiAMMD,/,n(rn) + 
where c = 2 max{(;i(c^),/)(c„)} -f 5(/(c„)c„ -I- df^Cu) - 2d(j){ct). 

The proof of this theorem is found in appendix]^ As proved in Theorem[T] the /-divergences can be 
bounded above by the empirical /-divergences, the empirical MMD, and 0{\/l/n). We minimize 
the error between the /-divergences and the empirical /-divergences by minimizing the empirical 
MMD. In addition, the error bound does not depend on the complexity of Q. This implies that in 
order to guarantee the upper bound on the /-divergences, we should choose Q so that it is large 
enough to satisfy d(j){r*{V, f{X))) G G- X. large G, however, can lead an over estimation of the 
/-divergences. 

The convergence rate, i.e., the absolute value of the estimation error, as shown by IH is dependent 
on the convergence rate of the empirical process with respect to G- However, the upper bound 
proved by Theorem [T] does not contain the complexity term of G, such as Rademacher complexity, 
the covering entropy and the bracketing entropy, and thus is tighter than the convergence rate. 

^The efficient computation of the empirical MMD is shown in appendix[A| 
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Figure 1: The shape of (j) and the value of c in Theoremj^for various (j), where cg = 


5 Fairness-Aware Learning with a Divergence Estimation 


In this section, we provide an algorithm for solving Eq. ([T]| that includes the introduced estimation 
procedure for the /-divergences. We will then show that the algorithm can be formulated as a convex 
optimization problem. 


5.1 Algorithm for Fairness-Aware Learning with /-Divergence Estima¬ 
tion 

Following the estimation procedure described in Section we define the optimization problem of 
our fairness-aware learning as 

min Rnif) sub to + an£>MMD,/.n(»'n) < V, (4) 

fGJ^,rn>0 

where a„ is the constant larger than a that was defined in Theorem As indicated by Theorem 
D</>(.f) < D^^nif) + anDMMDj,n{rn) + 0{^/lJn) holds, which guarantees that the /-divergence 
of the resultant hypothesis of Eq. (|^ is less than 77 + 0{^/l/n). 

Let us consider the effect of the choice of cj) on the estimation error of the divergence. The upper 
bound of the estimation error shown in Theorem[T]does not depend on the choice of (j). Nevertheless, 
the choice of / changes the constant c. Letting eg = e~* and c„ = e*, Eigure[^shows the shape of 
(j) and the value of c corresponding to t for various functions /. As shown in Eigure[^ the smallest 
c is that for the Hellinger distance, and thus of these four divergences, it has the tightest bound on 
the probability ratio r*. 


5.2 Optimization 

The necessary condition of convexity of Eq. Q is the linearity of the functions g G G with respect 
to /. With mild assumptions, it can be made convex for any choice of ^ by a simple reformula¬ 
tion. 

Assumption 1. The hypothesis f G T is formed as f{x) = aigmaXy^y 6{x,y), and 6 * € 0 is 
linear with respect to the parameters. 

With this assumption, the function on the RKHS H is given as 0{x,y) = {^{x^y),w)u, where 
^ X X y ^ Ti, and w G H. The optimization problem in Eq. 0 can be rearranged as 

min i?„(/) subto min (£)0 „(/)-f a„£)MMD./,n(?’n)) < (5) 

rn>0 
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Since r„ is not appeared in the objective function in original optimization problem Eq. Q, we 
change the optimization problem so that the optimization with respect to r„ is only appeared in 
the constraint. Following the derivation of the dual problem in llT4l . we have the dual form of the 
constraint as 


+ a„E>MMD./,n(f’n)) = inax(E; 0 j„(p)), 
r „>0 g^y 

where 


E^^n{g) 


1 

n(n — 1) 




n 


1 

2,(Xn 


Ml 


and (j)*{v) = sup^(uu — 4'{u)). Letting 7 G we can rewrite the optimization problem in 

Eq. (|^ as 

min i?n(/) sub to I[f(xi) = j] = 7^- VzJ, maxE^^„(j, g) < ij, (6) 

g£y 

where 

^ c ^ n c ^ 

E^,nh^g)= ( XI +—\\g\\l^. 

-^7 1 1 ^-171 

l<2^j<n/c=l i—lk —1 

From the definition of /, we have I[f{xi) = j] = /[max^^j &{xi, k) — 9{xi,j) < 0]. Let = 

{x G ]R°|xi > 0 Vi, J2i=i — !}■ Then, we relax the indicator function as follows: 

min i?„(/)subto maxd(xi, k) - 0 (xi,j) <-jij, maxii;^_„(7,p) < 77. ( 7 ) 

/GJ=',-)'GAj k^j g^Q 

This optimization problem is convex, and its solution is equivalent to Eq. 0 . We prove this claims 
by the following corollary and theorem. 

Corollary 1. If Assumption^holds, and I is convex with respect to 9 G Q, the optimization problem 
in Eq. 0 is a convex optimization problem. 

Theorem 2. The solution of the optimization problem in Eq. 0 is equivalent to the solution of 
Eq. 0. 

The proofs of the corollary and the theorem can be found in appendix [C| 


6 Generalization Error Analysis 

We consider the generalization error bound of the learned hypothesis fnGJ- that is obtained by the 
algorithm described in Section]^ In our analysis, we use the two type of the Rademacher complexity, 
which measures the complexity of the class of the functions / : Z —K and are defined as 


= E 

1 "■ 

sup - Y] (Jif{Zi) 

, 7^f^(.F)=E 

1 

sup — 

n 


L/6^ ^ ^ J 





where (tjj) are the independent Rademacher variables, that is, Pr((Ti = +1) = Pr(cri = —1) = 

1 / 2 . 

In the generalization error analysis, since our fairness-aware learning algorithm have the probabilis¬ 
tic error c\J2t jn, we consider the set of hypotheses defined as 

Tr = {fG T\D^{^f) <77 + cy/^}. 

wherecis defined as in Theorem[^ Theorem[^shows that with probability at least l — e“^,/„ G 7>- 
Hence, application of the theorem in ||2l, which is appeared in appendix]^ yields the generalization 
error bound for our algorithm. 
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Corollary 2. Let f* be a hypothesis such that R{f*) = R{f)- Rt = {h : y x X ^ 

]R|ft.(Y, X) = i{Y, f{X)) — £(Y, f*{X)), f G Tr\, and let /„ G R be a hypothesis learned from 
the obtained set of samples Sn- Suppose that h{y, x) — h{y', x') < cfor any h G Rt, y, y' G y, 
and X, x' G X. Then, with probability at least 1 — — e~'^, 



Since "Hr C R, we have TZniRr) < R-niR), o"h^ < cru- Therefore, the convergence rate of 
the algorithm constrained by the /-divergences is lower than that of the algorithm without the con¬ 
straint. 

While our algorithm guarantees an upper bound on the /-divergences, it reduces the classification 
performance, as compared to the classifier learned by ERM. Accordingly, let us consider the general¬ 
ization error of the optimal hypotheses with and without the restriction on the /-divergences: 


R{f*)-R{n- 


( 8 ) 


This error represents the reduction in the classification performance caused by restricting the /- 
divergences. Since the error cannot be directly evaluated, we define the estimator of Eq. 
as 


Rnifn) - inf i?„(/). 


Our interest is to derive the convergence rate of this estimator. We denote £ o R' — {h : X x 
y —>■ K|/i(Y, X) — £{Y, f{X)), / G R'} for any R' C R, then the following theorem shows the 
convergence rate of the estimator. 

Theorem 3. Suppose that £{y, f{x)) — £{y', f{x')) < c for any f G R and {y, x), {y', x’) G Z. 
Then, with probability at least 1 — 2e“* — e“’’. 



The proof of this theorem is appeared in appendix [C| 

7 Conclusions 

In this paper, we considered fairness-aware learning for a classification problem, with the aim of 
learning the classifier that returns the prediction with the lowest misclassification rate and the lowest 
dependence on the viewpoint. Our contributions are as follows: (1) We propose a novel generalized 
procedure for estimating the /-divergences for fairness-aware learning. Our generalized estimation 
procedure provides a tighter upper bound of the estimation error by introducing the maximum mean 
discrepancy. (2) We formulate a general ERM framework for fairness-aware learning algorithm that 
is based on the empirical estimation procedure of the /-divergences, and that can guarantee an upper 
bound on the generalization dependency. Eurthermore, we provide an analysis of the generalization 
error of the proposed fairness-aware learning algorithm. 
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A Maximum Mean Discrepancy with Functions on a Reproduc¬ 
ing Kernel Hilbert Space 


Since we need to solve the maximization problem in I?MMD,/,n(f), evaluation of the empirical 
MMD takes considerable cost causing use of the iterative algorithm. However, if the elements of 
the class of the functions Q are represented by the inner products of the parameters, which includes 
the functions in a reproducing kernel Hilbert space (RKHS), the empirical MMD can be efficiently 
calculated. Let fc:Vxy’xVxy’—>Kbea universal kernel, and let fL be the RKHS induced by 
k. Let ^ : V X y —t H he the canonical feature map induced by k. 

Corollary 3. Suppose that Q — {g\g{V,f{X)) = f{X))),\\l3\\,^ < 1}. Then, the empiri¬ 

cal MMD is equivalent to 


DMMD,f,n{r) = 


1 


n{n — 1 ) 


^ [m,fix,)) - r(y„ f{x,mv., fix,))] 




(9) 


H 


Proof of Corollary^ From the dehnition of the MMD and Q, we have 

1 


DMMD,f,nir) = sup 


sup (a- 7 -^ E mV,JiX,))-riV„fiX,))<i>iV„fiX,))] 


^ [(/?, /(X,)))^ - riV„ fiX,)){/3, /(X,)))«] 

l< 27 ^j<n 


/ 3 |||/ 3 |Ik< 




n 


Since the supremum is achieved if the direction of /3 is equivalent to that of 


^ f{Xj)) - riV„ fiXi))<S>{V„ fiXi))], we get 


the claim. 


□ 


For simplicity of notation, we let ^{vi, f{xj)) be represented by d)^, r{vi, fixi)) by r^, and 
k{v^J{xj),VkJ{xi)) = {^{vij{xj)),^{vk: f{xi))) by kijki- Then, Eq. ^ can be rearranged 
as 


4 I] 




2=1 


n(n — 1 ) 


E 


1 


^iijk 




n'^{n — 1)^ 


E 


^ijki' 


l<i^j,k^l<n 


( 10 ) 


Let Q be a matrix such that Qij = knjj, and let p be a vector such that pi = J2i<j^k<n ^njk- 
Let r be a vector representation of r^. The matrix representation of the minimization of Eq. is 
obtained as 

min -r^Qr — p^r. 

r>0 2 


The minimizer of Eq. (lOi with respect to r can be easily obtained if Q is a positive dehnite ma¬ 
trix. 
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B Generalization Error Bound of Bartlett et al. ^ 


Bartlett et al. ^ proved the following theorem for the generalization error bound based on Bous- 
quet’s inequality: 

Theorem 4 (Bartlett et al. ^). Let H = {h : y x X ^ K|/r(y,X) = i{YJ{X)) - 
£(Y, f* {X)), f S J-}, and let /„ € J- be a hypothesis learned from the obtained set of samples 
Sn- Suppose that h{y,x) — h{y',x') < c for any h G LL, y,y' G y and x,x' G X. Then, with 
probability at least 1 — 

/ of At 

-+c— , 
n 6n 

where cr|^ = sup,jg^ Var[/i(y, X)]. 


C Proofs 

C. 1 Proof of Theorem [1] 

In order to prove Theorem [T] we prove following lemmas. 

Lemma 1. Suppose that ci < r*(V, f{X)) < Cu almost surely where q G (0,1] and Cu G [1, oo), 
then 

dfice) - dficu) < 

i [df{r*{V, f{X)))r*{V, f{X)) + f{X')))r*{r, f{X')) 

- df{T*{V\ f{X))) - dfir*{V, /(X')))) 

< d<j){cu)cu - d(l>{ci) a.s.. 

Proof Since df is non-decreasing function due to the convexity of f, we have 

dfia) < df{r*{V'JiX))),df{r*{V,f{X))) < df{cu) a.s.. (11) 

From the assumption that the subdifferential of f contains zero, d(j){u) < 0 for m S (0,1] and 
d(j){u) > 0 for u € [1, oo) which results that d(j){cu) and d(j){ci) are positive and negative, respec¬ 
tively. By this fact and Eq. ( [TT] l, we have 

dfia) < df{r*{VJ{X)))r*iVJ{X)) < 9<)>(c,)c„ a.s.. (12) 

Combining Eqs. ([11) and @ gives the claim. □ 

Lemma 2. Suppose that ci < r*(V, f{X)) < c„ almost surely where ci G (0,1] and Cu G [1, oo), 
then 

-max{0(Q),(/>(ctj)} < E[f(r*(V,f(X)))] - f(r*(V,f(X))) < max{()i(cf), 0(c„)} a.s.. 

Proof. Erom the assumption that the subdifferential of f contains zero, d(j){u) < 0 for u G (0,1] 
and d<p{u) > 0 for u G [1, oo) which yields that f(u) is non-increasing in (0,1] and non-decreasing 
in [1, oo). Therefore, 0 < f/>(r*(V, f(X))) < max{0(c„), a.s., which gives the claim. □ 

Proof of Theorem^ The error D^{f) — D,p^n{f) is decomposed as 

DM) - DMf) = D^f) - <J/) + <J/) - DMf)- 
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From the definition of the subdifferential, we have 

1 ” 

= - fix,))) - HrniV,, fix,)))] 


2 = 1 


1 "" 

<- ^ dct>ir*iV„ fiX,)))ir*iV,, fiX,)) - r„iV„ fiX,))) 

2=1 

- n ^ 

= -^ dc^iT*iV„ fiX,)))r*{V,, fix,)) --J2 dHr*iV„ fiX,)))r„iV,, fiX,)). (13) 

2=1 2=1 

Since the left term in Eq. is regarded as the empirical mean of d4>ir*iVi, f{Xi))) with respect 
to Pr(l/, fix)) weighted by r*{V, fiX)) = (iPr(l/)Pr(/(X))/(iPr(l/, /(X)), it is well approx¬ 
imated by the U-statistics of the expectation of d(j)ir*{Vi, fiXi))) with respect to Pr(P)Pr(/(X)). 
We thus decompose the right hand side in Eq. (13 i using the U-statistics of d(j)ir*iVi, /(X^))) as 

Knif)-D;:M) 

n 

< Y,d^ir*iV„ fiX,)))r*iV,, fix,))-—— Y, dHr*iV„ fiXj))) 

IL IL\ 'L ] 

i=l ' ' l<i^j<n 

^ mr*iV,,fiX,)))-r,,iV,,fiX,))dct>ir*{V,,fiX,)))]. (14) 

Since Q contains c)(^(r*(y, /(X)))/a, the last term in Eq. ( |l4] ) is bounded above by the MMD as 
— ^ ^ mr*iV„fiX,)))-r„iV„fiX,))df>ir*iV,JiX,)))] 

<asup-—- Y [9iy^JiXj))-rr,iVi,fiX,))giV„fiXi))] 
gee n(n-l) 

=aDMMDJ,nirn)- 

Letting the first two terms in Eq. ( [I4| ) be 

n 

E mr*iv„fix,))), 

IL I Ly 11 1- I 

2=1 ^ ^ Ki^Kn 


the error is bounded above as 


DM) - DMf) < D; if) - D;Yf) + Mnif) + ai?MMD./.n(r„). 

Next, we derive the probabilistic bound on the D'^ (/) — ^(/) + ( 70 _„(/). The expectations of 

f /0 „(/) is equivalent to zero 


E[U^,„(/)] =E 


1 

- ^ dM*iV„ fiX,)))r*{V,, fix,)) 

n ^ 


1 

n(n — 1) 


5] dM^iyjix,))) 


=Emr*iV, fiX)))r*iV, /(X))] - E[9</)(r*(U', /(X)))] 

=E[a<^(r*(U', /(X)))] - E[a<^(r*(U', /(X)))] 

=0. (15) 


As proved the almost surely bound in Lemmas [T] and application of the exponential inequality for 
the U-statistics ii gives with probability at least 1 — e * 


<(/) - <n(/) + UMf) < E[<(/) 


<n(/)+C^ 0 ,n(/)]+C 
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where c is a constant defined as in the claim. As shown in Eq. (15 1 , E[[7^_„(/)] = 0. In addition, 
since £> 0 *„(/) is the unbiased estimator of DJ* (/), (/)] = DJ* (/). □ 


C.2 Proof of Theorem and Corollary [J 

Proof of Theorem^ For any i, if 0{xi, k) — 9{xi,j) < 0 holds for any j G li Q {1, c} 

such that \Ii\ > 1, 9{xi, k) — 6{xi,j) is positive for all j because of the definition of 

max. Thus, this case violates the first constraint in Eq. since 1 lij — ^ 

require 7 ^ > 0 for some j. Therefore, for any i and j, ifmax/j^j 9{xi, k) — 9{xi,j) < 0, then 
maxk^j 9{xi, k) — 9{xi,p) > 0 for any p j because of the definition of max. This indicates that 
if 7 are feasible, then only one element of is one and the others are zero for each i. Since if 

maxk^j 9{xi, k) — 9{xi,j) < 0 holds then f{xi) = j, I[f{xi) = j] = jij. Thus, since the solution 
of Eq. (|^ holds the constraints in Eq. (|^, we get the claim. □ 

Proof of Corollary^ Since „( 7 , g) is linear with respect to 7 , max^gp g) is a convex 

function with respect to 7 . Since i?„(/) is convex with respect to 9 because of the convexity of I, 
the objective function in Eq. (j^ is convex with respect to / and 7 . In addition, max^^j 9[xi, k) — 
(^{xi,j) + 77 is convex with respect to 9 and 7 ^. Thus, all constraints are convex inequalities or 
linear equations. □ 


C.3 Proof of Theorem |3] 


For the proof of Theorem]^ we prove the following theorem. 

Theorem 5. Let f* G J- be a hypothesis such that R{f*) = inf/gjr i?(/). Suppose that 
£{y, f{x)) — £{y' 1 fix')) < cfor any f G T and (y, x), (y', x') G Z. Then, with probability at 
least 1 — 


Rin 


inf i?„(/) 


<4,TZf%ioR) + aio^ 



At 


c 


3n 


Proof Since supf^jr{R{f*) - Rif)) = mf/gjri?(/) - inf/gjrii(/) = 0, we have 

Rin - inf i?„(/) = supiRif ) - RM)) < sup(i?(r) - Rif)) + sup(i?(/) - i?„(/)) 

= sup(i?(/) - Rnif))- 

/ 6 ^ 

From the definition of sup, we have 

sup(i?(r) - RM)) > Rif) - Rnif). 

f&tF 

Hence, we have 


SUp(i?(/*) - Rnif)) 
feT 


< 


max< 


sup(i?(/) - Rnif)) 
/ 6 ^ 


Min -Rnif)] 


< sup\Rif) - Rnif)\- 
/ 6 ^ 


The bound on supjgjr|i?(/) — i?„(/)| is derived in the same manner of the proof of Theoremj^ 
Application of the Bousquet’s inequality gives with probability at least 1 — 


sup|i?(/) - Rnif)\ < Esup|i?(/) 

/Gjr- 


Ruif)\ 


3n’ 
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where v = 2E sup^gjr|i?(/) — i?n(/)| + fact that y/u + v < y/u + y/v and 2y/uv 

au + V/a for u,v,a > 0, application of the symmetrization technique yields the claim. 

Corollary 4. Let f* € Jv o hypothesis such that R{f*) = inf^gjr^ R-if)- Suppose that 
£{y,f{x)) — i{y', f{x')) < c for any f £ T and {y,x),{y',x') £ Z. Then, with probability at 
least 1 — e“* — e“’^ 



Proof. The proof follows the same manner of the proof of Theorem expect the upper bound of 
R{f*) ~ Rnifn)- From Theorem[^ we have with probability at least 1 — e~'^ 


R{f*) - Rnifn) < sup (Rif*) - Rnif)) < SUp {R{f) - Rn{f))- 




fFtFr 


□ 


Proof of Theorem^ The error is bounded above as 



Combining Theoremj^and Corollaryj^gives the claim. 


□ 
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